{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [19、Transformer模型Encoder原理精讲及其PyTorch逐行实现](https://www.bilibili.com/video/BV1cP4y1V7GF)\n",
    "\n",
    "## 难点\n",
    "\n",
    "- word embedding\n",
    "- position embedding\n",
    "- encode self-attention mask\n",
    "- intra-attention mask\n",
    "- decoder self-attention mask\n",
    "- multi-head self-attention\n",
    "\n",
    "项目推荐：\n",
    "- vit\n",
    "- swing transformer\n",
    "- nlp：预训练\n",
    "- huggingface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image-20250624205950984](http://assets.hypervoid.top/img/2025/06/24/image-20250624205950984-f71b.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import numpy as np\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Embedding\n",
    "\n",
    "以序列建模为例子考虑 source sentence 和 target sentence "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "src_seq =  tensor([[4, 2, 0, 0],\n",
      "        [5, 5, 1, 4]], dtype=torch.int32)\n",
      "tgt_seq =  tensor([[1, 7, 3, 7],\n",
      "        [5, 2, 6, 0]], dtype=torch.int32)\n",
      "emb_table= Parameter containing:\n",
      "tensor([[-0.1615,  1.0848, -0.7028,  1.0415, -0.4725, -1.1625,  1.0262,  1.0964],\n",
      "        [-1.0604, -0.1099, -0.7894, -0.8810, -0.2279,  0.7442,  1.2569,  1.0380],\n",
      "        [-0.4024,  0.8135, -0.9275, -1.3547, -1.0342,  0.7953,  0.0574, -0.0495],\n",
      "        [ 0.2147,  0.1891,  0.4006,  0.7699,  1.3729,  0.3016, -1.2966,  0.4621],\n",
      "        [-1.3259,  1.9471, -1.2459, -0.4742, -0.2590, -1.0082, -0.3390, -0.4506],\n",
      "        [-2.2342,  0.8628,  0.0396, -0.4539,  0.0152, -0.9146, -0.7273,  0.8294],\n",
      "        [ 0.5619, -0.8959, -0.5630, -1.0634,  0.3173,  0.7419,  2.1825,  0.9669],\n",
      "        [ 0.6489,  0.3962,  0.9375,  0.2318,  0.7550, -0.4951,  0.0103, -0.0439],\n",
      "        [ 0.4689, -0.6901, -1.6390, -1.0486,  0.0407,  1.2266,  0.1206, -1.2306]],\n",
      "       requires_grad=True)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(tensor([[[-1.3259,  1.9471, -1.2459, -0.4742, -0.2590, -1.0082, -0.3390,\n",
       "           -0.4506],\n",
       "          [-0.4024,  0.8135, -0.9275, -1.3547, -1.0342,  0.7953,  0.0574,\n",
       "           -0.0495],\n",
       "          [-0.1615,  1.0848, -0.7028,  1.0415, -0.4725, -1.1625,  1.0262,\n",
       "            1.0964],\n",
       "          [-0.1615,  1.0848, -0.7028,  1.0415, -0.4725, -1.1625,  1.0262,\n",
       "            1.0964]],\n",
       " \n",
       "         [[-2.2342,  0.8628,  0.0396, -0.4539,  0.0152, -0.9146, -0.7273,\n",
       "            0.8294],\n",
       "          [-2.2342,  0.8628,  0.0396, -0.4539,  0.0152, -0.9146, -0.7273,\n",
       "            0.8294],\n",
       "          [-1.0604, -0.1099, -0.7894, -0.8810, -0.2279,  0.7442,  1.2569,\n",
       "            1.0380],\n",
       "          [-1.3259,  1.9471, -1.2459, -0.4742, -0.2590, -1.0082, -0.3390,\n",
       "           -0.4506]]], grad_fn=<EmbeddingBackward0>),\n",
       " torch.Size([2, 4, 8]))"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 构建序列，序列字符以在词表中索引的形式\n",
    "from torch import int32\n",
    "\n",
    "num_words = 8 # 单词表大小\n",
    "model_dim = 8 # 模型特征大小\n",
    "batch_size = 2\n",
    "# 代表句子长度\n",
    "src_len, tgt_len = [2,4], [4,3]\n",
    "# src_seq = [torch.randint(1, num_words, (L, )) for L in src_len]\n",
    "src_seq = torch.zeros((batch_size, max(src_len)), dtype=int32)\n",
    "tgt_seq = torch.zeros((batch_size, max(tgt_len)), dtype=int32)\n",
    "for i in range(batch_size):\n",
    "    _len = int(src_len[i])\n",
    "    src_seq[i, :_len] = torch.randint(1, num_words, (_len,))\n",
    "    _len = int(tgt_len[i])\n",
    "    tgt_seq[i, :_len] = torch.randint(1, num_words, (_len,))\n",
    "\n",
    "src_emb_table = nn.Embedding(num_words+1, model_dim)\n",
    "tgt_emb_table = nn.Embedding(num_words+1, model_dim)\n",
    "\n",
    "# src_len, tgt_len\n",
    "print(\"src_seq = \", src_seq)\n",
    "print(\"tgt_seq = \", tgt_seq)\n",
    "print(\"emb_table=\", src_emb_table.weight)\n",
    "\n",
    "# 构造 word embedding\n",
    "src_emb = src_emb_table(src_seq) # 就是算出每个label对应的向量\n",
    "tgt_emb = tgt_emb_table(src_seq)\n",
    "src_emb, src_emb.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Position Embedding\n",
    "\n",
    "Transformer没有位置假设和全局假设，因此需要位置信息。\n",
    "\n",
    "$$\n",
    "PE(pos, 2i) = \\sin(\\frac{pos}{10000^{2i/d_{\\text{model}}}})\\\\\n",
    "PE(pos, 2i+1) = \\cos(\\frac{pos}{10000^{2i/d_{\\text{model}}}})\n",
    "$$\n",
    "\n",
    "使用Sin和Cos组合的位置编码，泛化能力比较强，具有对称性，具有唯一性。\n",
    "- 唯一性：每个位置的编码独一无二\n",
    "- 编码值在不同维度上，有不同的变化频率\n",
    "- 相对位置的线性表示：可以将 PE(pos+k, 2i) 和 PE(pos+k, 2i+1) 表示成 PE(pos, 2i) 和 PE(pos, 2i+1) 的线性组合。这等价于将位置 pos 的编码向量乘以一个不依赖于 pos 而只依赖于 k 的旋转矩阵.\n",
    "    - 因此模型不需要去学习每个绝对位置的含义，只需要学习这个“旋转”操作的含义。例如，模型可以学会一个通用的变换，来关注“前面第k个词”或“后面第k个词”，无论当前的绝对位置 pos 是多少。这使得模型能够轻松地推断出相对位置关系，这对于理解语言至关重要。\n",
    "\n",
    "\n",
    "$$\n",
    "\\sin(A+B) = \\sin(A)\\cos(B) + \\cos(A)\\sin(B) \\\\\n",
    "\\cos(A+B) = \\cos(A)\\cos(B) - \\sin(A)\\sin(B)\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(torch.Size([2, 4, 8]), torch.Size([2, 4, 8]))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 构建pos embedding\n",
    "max_pos = 8\n",
    "pos = torch.arange(max_pos).reshape((-1, 1)) # 一列\n",
    "# 总model_dim\n",
    "idx = torch.arange(0, model_dim).reshape((1, -1)) # 一行\n",
    "inner = pos / torch.pow(10000, idx / model_dim)\n",
    "pos_emb = torch.empty_like(inner)\n",
    "inner = inner[:, 0::2] # 只取i的偶数部分\n",
    "pos_emb[:, 0::2] = torch.sin(inner)\n",
    "pos_emb[:, 1::2] = torch.cos(inner)\n",
    "pelayer = nn.Embedding(pos_emb.shape[0], pos_emb.shape[1])\n",
    "pelayer.weight = nn.Parameter(pos_emb, requires_grad=False)\n",
    "# 现在构造上面句子的 pos embedding\n",
    "\n",
    "src_pos = torch.arange(max(src_len)).repeat(batch_size, 1)\n",
    "tgt_pos = torch.arange(max(tgt_len)).repeat(batch_size, 1)\n",
    "src_pe = pelayer(src_pos)\n",
    "tgt_pe = pelayer(tgt_pos)\n",
    "src_pe.shape, tgt_pe.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## encoder self-attention mask\n",
    "\n",
    "用于加速，一次训练多对样本序列"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 为什么要scale？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([0.0900, 0.2447, 0.6652])\n",
      "tensor([0.0024, 0.0473, 0.9503])\n",
      "tensor([4.5094e-05, 6.6925e-03, 9.9326e-01])\n"
     ]
    }
   ],
   "source": [
    "#  softmax 的马太效应\n",
    "\n",
    "arr = torch.arange(0, 3, dtype=torch.float32)\n",
    "print(F.softmax(arr, dim=0))\n",
    "print(F.softmax(arr * 3, dim=0))\n",
    "print(F.softmax(arr * 5, dim=0))\n",
    "# 在不同的 scale 下，softmax 出来的分布越来越尖锐。模型就不会更关心小一些的值了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "雅可比矩阵角度，scale越大，雅可比矩阵反而越小，导致训练的时候梯度变小，模型难以训练。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[ 0.0819, -0.0220, -0.0599],\n",
      "        [-0.0220,  0.1848, -0.1628],\n",
      "        [-0.0599, -0.1628,  0.2227]])\n",
      "tensor([[ 4.5092e-05, -3.0179e-07, -4.4790e-05],\n",
      "        [-3.0179e-07,  6.6478e-03, -6.6475e-03],\n",
      "        [-4.4790e-05, -6.6475e-03,  6.6923e-03]])\n",
      "tensor([[ 2.0611e-09, -9.3568e-14, -2.0610e-09],\n",
      "        [-9.3568e-14,  4.5396e-05, -4.5396e-05],\n",
      "        [-2.0610e-09, -4.5396e-05,  4.5417e-05]])\n",
      "tensor(0.9789) tensor(0.0268) tensor(0.0002)\n"
     ]
    }
   ],
   "source": [
    "# 雅可比角度\n",
    "from torch import Tensor\n",
    "from torch.autograd.functional import jacobian\n",
    "\n",
    "\n",
    "def softmax_f(arr):\n",
    "    return F.softmax(arr, dim=0)\n",
    "\n",
    "\n",
    "res1: Tensor = jacobian(softmax_f, arr)\n",
    "res2: Tensor = jacobian(softmax_f, arr * 5)\n",
    "res3: Tensor = jacobian(softmax_f, arr * 10)\n",
    "\n",
    "print(res1)\n",
    "print(res2)\n",
    "print(res3)\n",
    "print(res1.abs().sum(), res2.abs().sum(), res3.abs().sum())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[4, 2, 0, 0],\n",
      "        [5, 5, 1, 4]], dtype=torch.int32)\n",
      "tensor([[1, 1, 0, 0],\n",
      "        [1, 1, 1, 1]])\n",
      "tensor([[[False, False,  True,  True],\n",
      "         [False, False,  True,  True],\n",
      "         [ True,  True,  True,  True],\n",
      "         [ True,  True,  True,  True]],\n",
      "\n",
      "        [[False, False, False, False],\n",
      "         [False, False, False, False],\n",
      "         [False, False, False, False],\n",
      "         [False, False, False, False]]])\n",
      "prob:  tensor([[[0.1279, 0.1505, 0.0000, 0.0000],\n",
      "         [0.1732, 0.2863, 0.0000, 0.0000],\n",
      "         [0.0000, 0.0000, 0.0000, 0.0000],\n",
      "         [0.0000, 0.0000, 0.0000, 0.0000]],\n",
      "\n",
      "        [[0.8721, 0.8495, 1.0000, 1.0000],\n",
      "         [0.8268, 0.7137, 1.0000, 1.0000],\n",
      "         [1.0000, 1.0000, 1.0000, 1.0000],\n",
      "         [1.0000, 1.0000, 1.0000, 1.0000]]]) torch.Size([2, 4, 4])\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_91144/4008299438.py:13: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
      "  prob = F.softmax(score)\n"
     ]
    }
   ],
   "source": [
    "# mask大小：[batch_size, max_src_len, max_src_len]，值 0/-inf\n",
    "print(src_seq)\n",
    "valid_encoder_pos = torch.where(src_seq != 0, 1, 0)\n",
    "print(valid_encoder_pos)\n",
    "valid_encoder_pos=valid_encoder_pos.unsqueeze_(2)\n",
    "valid_encoder_mat = torch.bmm(valid_encoder_pos, valid_encoder_pos.transpose(1,2))\n",
    "# 获得mask矩阵，代表能否建立关联性（和pad后的无关系）\n",
    "mask = (1-valid_encoder_mat).to(torch.bool)\n",
    "print(mask)\n",
    "# mask用法：\n",
    "score = torch.randn(mask.shape)\n",
    "score.masked_fill_(mask, -np.inf)\n",
    "prob = F.softmax(score)\n",
    "print(\"prob: \", prob, prob.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [20、Transformer模型Decoder原理精讲及其PyTorch逐行实现](https://www.bilibili.com/video/BV1Qg411N74v?spm_id_from=333.788.player.switch&vd_source=cdd897fffb54b70b076681c3c4e4d45d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 位置编码的泛化能力\n",
    "\n",
    "假如说我们训练时候使用512的长度，但是在实际使用的时候遇到了1000 的长度，这时候可以通过数学公式推导出对应的位置编码。\n",
    "$$\n",
    "M. \\begin{bmatrix} \\sin(\\omega_k. t) \\\\ \\cos(\\omega_k. t) \\end{bmatrix} = \\begin{bmatrix} \\sin(\\omega_k. (t + \\phi)) \\\\ \\cos(\\omega_k. (t + \\phi)) \\end{bmatrix}\n",
    "$$\n",
    "\n",
    "这里 $t+\\phi$ 超过了限度，但是假如我们可以找到一个矩阵 $M$，使得：\n",
    "$$\n",
    "\\begin{bmatrix} u_1 & v_1 \\\\ u_2 & v_2 \\end{bmatrix} \\begin{bmatrix} \\sin(\\omega_k . t) \\\\ \\cos(\\omega_k . t) \\end{bmatrix} = \\begin{bmatrix} \\sin(\\omega_k . (t + \\phi)) \\\\ \\cos(\\omega_k . (t + \\phi)) \\end{bmatrix}\n",
    "$$\n",
    "\n",
    "然后借助和差化积公式解方程：\n",
    "\n",
    "$$\n",
    "\\begin{bmatrix} u_1 & v_1 \\\\ u_2 & v_2 \\end{bmatrix} \\begin{bmatrix} \\sin(\\omega_k . t) \\\\ \\cos(\\omega_k . t) \\end{bmatrix} = \\begin{bmatrix} \\sin(\\omega_k . t) \\cos(\\omega_k . \\phi) + \\cos(\\omega_k . t) \\sin(\\omega_k . \\phi) \\\\ \\cos(\\omega_k . t) \\cos(\\omega_k . \\phi) - \\sin(\\omega_k . t) \\sin(\\omega_k . \\phi) \\end{bmatrix}\n",
    "$$\n",
    "\n",
    "$$\n",
    "u_1 \\sin(\\omega_k . t) + v_1 \\cos(\\omega_k . t) = \\cos(\\omega_k . \\phi) \\sin(\\omega_k . t) + \\sin(\\omega_k . \\phi) \\cos(\\omega_k . t) \\quad (1) \\\\\n",
    "u_2 \\sin(\\omega_k . t) + v_2 \\cos(\\omega_k . t) = - \\sin(\\omega_k . \\phi) \\sin(\\omega_k . t) + \\cos(\\omega_k . \\phi) \\cos(\\omega_k . t) \\quad (2)\n",
    "$$\n",
    "\n",
    "解得：\n",
    "$$\n",
    "u_1 = \\cos(\\omega_k . \\phi) \\quad v_1 = \\sin(\\omega_k . \\phi) \\\\\n",
    "u_2 = - \\sin(\\omega_k . \\phi) \\quad v_2 = \\cos(\\omega_k . \\phi)\n",
    "$$\n",
    "\n",
    "\n",
    "于是得到矩阵 $M$ 就是：\n",
    "\n",
    "$$\n",
    "M_{\\phi,k} = \\begin{bmatrix} \\cos(\\omega_k . \\phi) & \\sin(\\omega_k . \\phi) \\\\ - \\sin(\\omega_k . \\phi) & \\cos(\\omega_k . \\phi) \\end{bmatrix}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Intra Attention Mask\n",
    "\n",
    "计算 $Q \\times K^T$, `shape = (batch_size, target_seq_len, src_seq_len)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "每个batch下，代表 decoder[i] 对 encoder[j] 有效性 （是否需要计算相关性）\n",
      "tensor([[[1, 1, 0, 0],\n",
      "         [1, 1, 0, 0],\n",
      "         [1, 1, 0, 0],\n",
      "         [1, 1, 0, 0]],\n",
      "\n",
      "        [[1, 1, 1, 1],\n",
      "         [1, 1, 1, 1],\n",
      "         [1, 1, 1, 1],\n",
      "         [0, 0, 0, 0]]])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "tensor([[[False, False,  True,  True],\n",
       "         [False, False,  True,  True],\n",
       "         [False, False,  True,  True],\n",
       "         [False, False,  True,  True]],\n",
       "\n",
       "        [[False, False, False, False],\n",
       "         [False, False, False, False],\n",
       "         [False, False, False, False],\n",
       "         [ True,  True,  True,  True]]])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 1 代表是不是真的单词，不是padding的\n",
    "valid_encoder_pos = torch.where(src_seq != 0, 1, 0).unsqueeze(2)\n",
    "# print(valid_encoder_pos.shape, valid_encoder_pos, sep='\\n')\n",
    "\n",
    "valid_decoder_pos = torch.where(tgt_seq != 0, 1, 0).unsqueeze(2)\n",
    "# print(valid_decoder_pos.shape, valid_decoder_pos, sep='\\n')\n",
    "\n",
    "# 2,4,1 @ 2,1,4 => 2,4,4\n",
    "valid_cross_pos = torch.bmm(valid_decoder_pos, valid_encoder_pos.transpose(1,2))\n",
    "print(\"每个batch下，代表 decoder[i] 对 encoder[j] 有效性 （是否需要计算相关性）\")\n",
    "print(valid_cross_pos)\n",
    "mask_cross_attention = (1-valid_cross_pos).to(torch.bool)\n",
    "mask_cross_attention"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Decoder self-attention-mask\n",
    "\n",
    "decoder是一个自回归（Auto-Regressive）的模型，不是一次性输出所有结构，一次次地预测结果，可以理解为一个条件概率的生成过程。\n",
    "在训练阶段，一次性送入所有数据，但是不能让模型知道所有结果，需要让模型自己预测结果。\n",
    "\n",
    "decoder得mask是一个上三角矩阵"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[tensor([[1., 0., 0., 0.],\n",
       "         [1., 1., 0., 0.],\n",
       "         [1., 1., 1., 0.],\n",
       "         [1., 1., 1., 1.]]),\n",
       " tensor([[1., 0., 0.],\n",
       "         [1., 1., 0.],\n",
       "         [1., 1., 1.]])]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[torch.tril(torch.ones((L,L))) for L in tgt_len]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dl",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
